About the dataset

In this exploratory analysis we will explore a dataset from the company Prosper, who is part of the peer-to-peer lending industry. This is one of the datasets provided by Udacity : Prosper Loan Data. This data set contains 113,937 loans with 81 variables on each loan, including loan amount, borrower rate (or interest rate), current loan status, borrower income, borrower employment status, borrower credit history, and the latest payment information.

# Load the data
pld <- read.csv("C:/Users/dgspa/Desktop/R/prosperLoanData.csv")

Exploratory Analysis

Univariate analysis

I am concentrating my analysis on few of the many variables in the dataset. I am basing my analysis on delinquencies and its correlation in the data. The term “delinquent” commonly refers to a situation where a borrower is late or overdue on a payment, such as income taxes, a mortgage, an automobile loan or a credit card account. There are consequences for being delinquent, depending on the type, duration and cause of the delinquency.

Dimensions of the dataset

# Investigate dataset dimensions
dim(pld)
## [1] 113937     81

Now I setup a new dataframe by keeping only the variables I will consider for the analysis out of the total 81 variables.

Column “ListingCategory” is actually labelled “ListingCategory..numeric.” as its column type (numeric) is mapped to a string of one word descriptions/categories.

# Review column name to fix subset() issue
names(pld[17])
## [1] "ListingCategory..numeric."
# List of variables to keep
Requiredcols <- c("Term","LoanStatus","BorrowerState","ListingCategory..numeric.",
               "CreditScoreRangeLower","CreditScoreRangeUpper",
               "BankcardUtilization","IncomeRange", "TotalProsperLoans",
               "LoanOriginalAmount", "Investors","ListingCreationDate",
               "Occupation","IsBorrowerHomeowner","BorrowerAPR","BorrowerRate",
               "Recommendations","DebtToIncomeRatio", "StatedMonthlyIncome")

# Keep only variables in Requiredcols
prosperloans <- subset(pld, select=Requiredcols)

str(prosperloans)
## 'data.frame':    113937 obs. of  19 variables:
##  $ Term                     : int  36 36 36 36 36 60 36 36 36 36 ...
##  $ LoanStatus               : Factor w/ 12 levels "Cancelled","Chargedoff",..: 3 4 3 4 4 4 4 4 4 4 ...
##  $ BorrowerState            : Factor w/ 52 levels "","AK","AL","AR",..: 7 7 12 12 25 34 18 6 16 16 ...
##  $ ListingCategory..numeric.: int  0 2 0 16 2 1 1 2 7 7 ...
##  $ CreditScoreRangeLower    : int  640 680 480 800 680 740 680 700 820 820 ...
##  $ CreditScoreRangeUpper    : int  659 699 499 819 699 759 699 719 839 839 ...
##  $ BankcardUtilization      : num  0 0.21 NA 0.04 0.81 0.39 0.72 0.13 0.11 0.11 ...
##  $ IncomeRange              : Factor w/ 8 levels "$0","$1-24,999",..: 4 5 7 4 3 3 4 4 4 4 ...
##  $ TotalProsperLoans        : int  NA NA NA NA 1 NA NA NA NA NA ...
##  $ LoanOriginalAmount       : int  9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
##  $ Investors                : int  258 1 41 158 20 1 1 1 1 1 ...
##  $ ListingCreationDate      : Factor w/ 113064 levels "2005-11-09 20:44:28.847000000",..: 14184 111894 6429 64760 85967 100310 72556 74019 97834 97834 ...
##  $ Occupation               : Factor w/ 68 levels "","Accountant/CPA",..: 37 43 37 52 21 43 50 29 24 24 ...
##  $ IsBorrowerHomeowner      : Factor w/ 2 levels "False","True": 2 1 1 2 2 2 1 1 2 2 ...
##  $ BorrowerAPR              : num  0.165 0.12 0.283 0.125 0.246 ...
##  $ BorrowerRate             : num  0.158 0.092 0.275 0.0974 0.2085 ...
##  $ Recommendations          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ DebtToIncomeRatio        : num  0.17 0.18 0.06 0.15 0.26 0.36 0.27 0.24 0.25 0.25 ...
##  $ StatedMonthlyIncome      : num  3083 6125 2083 2875 9583 ...

There are a few categorical features in the new dataset. Example of categorical values in “LoanStatus”

# Review LoanStatus categories
levels(prosperloans$LoanStatus)
##  [1] "Cancelled"              "Chargedoff"            
##  [3] "Completed"              "Current"               
##  [5] "Defaulted"              "FinalPaymentInProgress"
##  [7] "Past Due (>120 days)"   "Past Due (1-15 days)"  
##  [9] "Past Due (16-30 days)"  "Past Due (31-60 days)" 
## [11] "Past Due (61-90 days)"  "Past Due (91-120 days)"

There is a challenge to be considered - If I’m investigating delinquencies, I need to define a definition for it, I need to do two things 1) learn the difference between Chargedoff and Defaulted and 2) assign a cutoff line for delinquents, i.e if someone is 1-15 days late on payment is that delinquent? and what if it is 61-90 days late?

Chargedoff:

A charge-off or chargeoff is the declaration by a creditor (usually a credit card account) that an amount of debt is unlikely to be collected. This occurs when a consumer becomes severely delinquent on a debt. Traditionally, creditors will make this declaration at the point of six months without payment. In the United States, Federal regulations require creditors to charge-off installment loans after 120 days of delinquency, while revolving credit accounts must be charged-off after 180 days

Defaulted:

In finance, default is failure to meet the legal obligations (or conditions) of a loan,[1] for example when a home buyer fails to make a mortgage payment, or when a corporation or government fails to pay a bond which has reached maturity.

Defining Delinquent Borrowers

# New variable to be used to identify "delinquent" borrowers
prosperloans$DelinquentBorrowers <- ifelse(
                                      prosperloans$LoanStatus == "Defaulted" |
                                      prosperloans$LoanStatus == "Chargedoff" |
                                      prosperloans$LoanStatus == "Past Due (61-90 days)" |
                                      prosperloans$LoanStatus == "Past Due (91-120 days)" |
                                      prosperloans$LoanStatus == "Past Due (>120 days)", 
                                      1, 0)

The creation date distribution of observations in the dataset could cause bias in further investigation, for example, if the majority of the data was observed during 2008-2009 (the financial crisis) this could skew the data towards having a majority of delinquency. To analy this, I’m going to explore the ListingCreationDate feature along with LoanStatus.

Loan Creation Year Distribution

# Create new variable 
prosperloans$LoanCreationYear <- format(
                                      as.Date(prosperloans$ListingCreationDate), 
                                      "%Y")

# Investigate distribution of loan creation year
ggplot(prosperloans, aes(prosperloans$LoanCreationYear, 
                         fill = factor(DelinquentBorrowers))) + 
  geom_bar() +
  scale_fill_discrete(labels = c("Good Standing", "Delinquent")) +
  theme(legend.title=element_blank()) +
  theme_minimal()

Term and Loan Status

# Summary of loan terms
summary(prosperloans$Term)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   12.00   36.00   36.00   40.83   36.00   60.00
# Plot defaulted terms and their loan status distribution
ggplot(prosperloans,aes(Term, fill = factor(DelinquentBorrowers))) + 
  geom_bar(width = 12) +
  scale_x_continuous(breaks = c(0,12,36,60)) + 
  scale_fill_discrete(labels = c("Good Standing", "Delinquent")) +
  theme(legend.title=element_blank()) +
  theme_minimal()

From the summary breakdown, the average term appears to be 36 months (3 yrs) with an original loan request being $8337, ranging from $1000 to $35,000. This seems to cover the full available range allowed by Prosper.

It’s also very clear that the majority of borrowers choose a 36 month term, where 60 and then 12 month terms are 2nd and 3rd most popular, respectively. And approximately 15% of the total borrowers are delinquent on their loans. Thinking of the distribution of delinquency arcoss terms, one would imagine larger term equates to a larger loan amount and monthly payments which may be too much to handle for some. This inturn would contribute to having more delinquncies. However, the distribution of delinquency is surprisingly very similar to that of the total loan distribution across loan terms. An investigation around loan amounts and across terms may reveal more.

Original Loan Amount

# Summary of loan terms
summary(prosperloans$LoanOriginalAmount)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    4000    6500    8337   12000   35000
# Plot loan terms
g <- ggplot(prosperloans,aes(LoanOriginalAmount, 
                             fill = factor(DelinquentBorrowers))) +
  theme_minimal() +
  scale_fill_discrete(labels = c("Good Standing", "Delinquent")) +
  theme(legend.title=element_blank()) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

g + geom_histogram(binwidth = 500)  

Summary data indicates mean and median are quite far apart. This is visualized in the graph which is right skewed. As the spikes of borrowers at $10,000, $15,000 and $25,000 are pulling the mean higher.

# Get details with reduced binwidth and added breaks
g + geom_histogram(binwidth = 100) + 
    scale_x_continuous(
      limits = c(500, quantile(prosperloans$LoanOriginalAmount, 0.95)),
      breaks = c(0, 2500, 5000, 7500, 10000, 12500, 15000, 17500, 20000))
## Warning: Removed 5285 rows containing non-finite values (stat_bin).

Taking a closer look at 95% of the data, specifically under $20,000, you can see multiple spikes at these numbers - 1000, 2000,…,8000, 9000, 10000 - dollars. Since most debt doesn’t come in nice round numbers, we can assume borrowers are probably rounding up instead to requesting exact dollar values for their loans. From the data provided, I don’t believe there is in specific way to determine if this assumption is true, however, given that debt / money is a floating value with interest rate applied as floating value, it is quite unlikely that the majority of borrowers have debt divisible by 5.

Income Range

# Plot out income range of all borrowers
positions <- c("Not employed", "$0", "Not displayed", 
                "$1-24,999", "$25,000-49,999", "$50,000-74,999",
                "$75,000-99,999", "$100,000+")

ggplot(prosperloans, aes(IncomeRange)) + 
  scale_x_discrete(limits = positions) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  geom_bar()

Reviewing income range allows us to see the distribution of total borrowers and their income with a $25k dollar bin / range. This could shine light into the fincanical standing of borrowers which may be contributing factors to delinqunency.

The distribution of borrowers across income range is somewhat normal if we consider that there are a number of users in the “Not Displayed” category, which was re-odered with the assumption that this category of borrowers do indeed have an income.

Borrowers State

ggplot(na.omit(prosperloans),
       aes(BorrowerState, fill = factor(DelinquentBorrowers))) +
  geom_bar(position = "dodge") +
  scale_y_log10() +
  theme_minimal() +
  scale_fill_discrete(labels = c("Good Standing", "Delinquent")) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1), 
        legend.title=element_blank(), legend.position = "top")

# Map the 2 letter abbrievations to their full name for map plot
prosperloans$BorrowerStateFullName <- tolower(
  state.name[match(prosperloans$BorrowerState, state.abb)])

us <- map_data("state")
states <- data.frame(state.center, state.abb)

# Group by stateName and delinquent borrowers
grp_by_state <- prosperloans %>%
  group_by(BorrowerStateFullName, BorrowerState, DelinquentBorrowers) %>%
  summarise(count = n()) %>%
  filter(DelinquentBorrowers == 1)
## Warning: package 'bindrcpp' was built under R version 3.4.3
# Plot US map with delinquent borrower data
ggplot() + 
  geom_map(data = us, map = us,
            aes(x = long, y = lat, map_id = region, label = region),
            fill="#ffffff", color="#ffffff", size=0.15) +
  geom_map(data = na.omit(grp_by_state), map=us,
            aes(fill = count, map_id = BorrowerStateFullName),
            color = "#ffffff", size = 0.15) +
  geom_text(data = states, 
            aes(x = x, y = y, label = state.abb, group = NULL), size = 2) +
  scale_fill_continuous(low='lightgray', high='black', guide='colorbar') +
  labs(x="Delinquent Borrowers by State", y=NULL) +
  coord_map("albers", lat0 = 39, lat1 = 45) +
  theme(panel.border = element_blank(),
        panel.background = element_blank(),
        axis.ticks = element_blank(),
        axis.text = element_blank())
## Warning: Ignoring unknown aesthetics: x, y, label

Borrowers state covers 52 states, this includes Alaska and Hawaii. Representation for California is skewing the graph, unable to distinguish smaller state counts and differences among states. Given California is the most populous state in the US, having a much higher count in borrowers is probably not a good indicator of a trend. To fix this and to better view any potential trends, a log10 transformation was applied to the y axis.

California has the highest count of delinquents but it should be taken into consideration that this state is also the more populous. While Florida, Texas and the East coast follow suite. Noticably, “middle America” has the lowest levels of delinquent borrowers but this could also be due to lower levels in population, technical savvy, lower cost of living resulting in better financial situations not requiring loans, etc. Two other unexpected results from the plot shows 1) there are no states with more delinquent borrowers than those in good standing and 2) the states of ND and WY have negliable numbers of delinquent borrowers.

Listing Category

# Map numbers to category
listing_category <- c("Not Available", "Debt Consolidation", "Home Improvement", 
                      "Business", "Personal Loan", "Student Use", "Auto", "Other", 
                      "Baby&Adoption", "Boat", "Cosmetic Procedure", 
                      "Engagement Ring", "Green Loans","Household Expenses", 
                      "Large Purchases", "Medical/Dental", "Motorcycle",
                      "RV", "Taxes", "Vacation", "Wedding Loans")

# Create new variable for mapped names (remember vector is not 0 based indexed, starts at 1)
prosperloans$ListingCategoryFullName <- listing_category[
                                          (prosperloans$ListingCategory..numeric.)+1]

g <- ggplot(prosperloans, aes(prosperloans$ListingCategoryFullName, 
                              fill = factor(DelinquentBorrowers))) +
  geom_bar(position="dodge") +
  theme_minimal() +
  scale_fill_discrete(labels = c("Good Standing", "Delinquent")) +
  theme(legend.title=element_blank()) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))
g

# Log file will help with comparison across all categories
g + scale_y_log10()

The listing category values represent the purpose or reason why the borrower is requesting a loan. Based on the above plot, although “Debt Consolidation” is by far the most common reason borrowers require a loan, this is misleading. Debt consolidation removes the specificity of where the debt came from, i.e. Auto, Student Use, Taxes could have equally contributed to someone’s debt but the “reason” the borrower is requesting a loan is for consolidation and therefore they flag “Debt Consolidation” as the purpose. One thing we can take away from the high volume of borrowers requesting debt consolidation loans, is that, many borrowers have debt from multiple sources.

Bivariate analysis

Continuing to look for trends in delinquencies, I will investigate possible expected and unexpected relationships between features with tools such as the scatterplot and boxplot.

Listing Cateogry and Loan Amount

ggplot(prosperloans, aes(ListingCategoryFullName, 
                         LoanOriginalAmount, 
                         group = ListingCategoryFullName)) +
  geom_boxplot() +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

ggplot(prosperloans, aes(ListingCategoryFullName, 
                         BorrowerRate, 
                         group = ListingCategoryFullName)) +
  geom_boxplot() +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

In the above plots, I examine the loan category, loan amounts and borrowing rate. Debt consolidation as expected has the highest loan amounts across categories; however, unexpectedly “Baby and Adoption” category is also the highest (tied) loan amount across categories. It would be interesting to investigate the total number of borrowers across these two categories, one would expect debt consolidation to have a large distribution of borrowers.

Also of note, the “Student Use” category loan amount it quite low. Given the well documented high levels of student debt, we can assume students didn’t suddenly stop having debt but instead don’t use Prosper to manage their debt or this category covers items other than student tutition such as lunches, dinners and books.

The borrower’s interest rate doesn’t seem to have any particular surprises. The mean borrowing rate across all categories tend to vary between 0.15 and 0.25.

Loan Term and Loan Amount

# look into outliers and discrete x axis on correct terms
ggplot(prosperloans, aes(Term, LoanOriginalAmount, group = Term)) +
  geom_boxplot() +
  scale_x_continuous(breaks = c(0,12,36,60)) +
  theme_minimal()

In this plot, I examin the three (3) term loans and borrower’s loan amounts. Surprisingly the 60 month term has a mean value much higher than that of the 36 month term loan. This considering our previous plot where only approx. 25k borrowers used 60 month term loans in comparison to approx. 75k borrowers who used 36 month terms.

Loan and Bankcard Utilization

# laon, bankcard and loan status (factor)
ggplot(prosperloans, aes(LoanOriginalAmount, BankcardUtilization)) +
  geom_point(alpha = 0.1, size = 0.75, position = "jitter") +
  theme_minimal() +
  scale_y_log10()
## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Removed 7604 rows containing missing values (geom_point).

This plot was quite surprising as I was expecting a different/stronger relationship where loan amounts would rise as bankcard utilization lowered. Instead, we see a high volume cluster of loans under $10,000 with borrowers close to 100% bankcard utilization.

Debt-to-Income Ratio

# monthly income vs debt to income ratio, highlighting delinquent borrowers
ggplot(prosperloans, aes(StatedMonthlyIncome, 
                         DebtToIncomeRatio, 
                         colour = factor(DelinquentBorrowers))) +
  geom_point(alpha = 0.1) +
  theme_minimal() +
  scale_x_continuous(limits = c(0,20000)) +
  scale_y_continuous(limits = c(0, 1)) +
  theme(legend.title=element_blank()) +
  scale_color_discrete(labels = c("Good Standing", "Delinquent")) +
  geom_density2d()
## Warning: Removed 10388 rows containing non-finite values (stat_density2d).
## Warning: Removed 10388 rows containing missing values (geom_point).

In comparing debt-to-income ratio with a borrower’s stated monthly income I was expecting to see a somewhat obvious trend that delinquent borrowers would have a lower monthly income and a higher debt-to-income ratio. With overplotting in the scatterplot two additional techniques were used to more clearly reveal any trends or unexpected results.

Borrwers with stated incomes over $20k and debt-to-income ratios over 1 were considered as outliers and removed from the plot.

The density contour lines show a high concentration of delinquent borrowers earn less than $2500 a month but have a low debt-to-income ratio of under 0.50 (or 50%). The plot also suggests a negative correlation between monthly income and debt-to-income ratio, i.e the more a borrower makes in monthly income the lower their debt-to-income ratio; However, this does not guarantee the loan will not go into delinquency.

After further review from multiple online resources, it seems that a typical “good” debt-to-income ratio is under approx. 36%.

Multivariate Analysis

In this section, I’ll dive into examining the relationships between multiple variables of the dataset.

First, a correlation matrix will be used to calculate the coefficients in order to help start the investigation process where by the variables with strong and/or weak relationships will be reviewed.

Correlation Matrix

# Pearson correlation coefficients, using pairwise observations (default method)
# Non-numeric columns automatically removed/ignored
ggcorr(prosperloans, label = TRUE, label_size = 3, 
       hjust = 0.8, size = 2.5, color = "black", layout.exp = 2)
## Warning in ggcorr(prosperloans, label = TRUE, label_size = 3, hjust =
## 0.8, : data in column(s) 'LoanStatus', 'BorrowerState', 'IncomeRange',
## 'ListingCreationDate', 'Occupation', 'IsBorrowerHomeowner',
## 'LoanCreationYear', 'BorrowerStateFullName', 'ListingCategoryFullName' are
## not numeric and were ignored

The correlation matrix revealed a few surprising things - I thought there would be a much stronger relationship between interest rate (BorrowerRate) and the credit score (CreditScoreRangeUpper/Lower). At a score of -0.5 it’s the strongest correlation out of the selected variables. Also, the number of investors (Investors) and the borrower loan amount (LoanOriginalAmount) has a positive correlation of 0.4, which was somewhat expected as investors are only allowed to contribute a portion of a loan amount, i.e. no one investor can lend an entire loan amount.

Credit Score and Borrowing Rate

ggplot(prosperloans, aes(CreditScoreRangeUpper, BorrowerRate,
                               color = factor(DelinquentBorrowers))) +
  scale_x_continuous(limits = c(400, 1000)) +
  scale_y_continuous(limits = c(0, 0.4)) +
  geom_point(alpha = 0.1, position = "jitter") +
  scale_color_discrete(labels = c("Good Standing", "Delinquent")) +
  theme_minimal() +
  theme(legend.title=element_blank())
## Warning: Removed 728 rows containing missing values (geom_point).

We see an expected result here, where the trend is lower credit scores tend to have higher borrowing rates and higher credit scores obtaining a lower borrowing rate. Delinquent borrowers also tend to have lower credit scores, this could be in part due to their higher interest rates on loans hendering payments and causing delinquencies. If lower credit scores mean higher borrowing rates, and higher rates tend to cause delinquency, how does anyone with a low credit score get out of the cycle? What’s the cause and effect? More investigation would be required to come to a conclusion. And also as expected, borrowers with higher credit scores and lower interest rate, tend to be in good standing with their loans.

Credit Score and Loan Amount

Summary: Delinquent Borrowers

summary(subset(prosperloans$CreditScoreRangeUpper, 
               prosperloans$DelinquentBorrowers == 1))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    19.0   619.0   679.0   661.5   719.0   879.0     174

Summary: Good Standing

summary(subset(prosperloans$CreditScoreRangeUpper, 
               prosperloans$DelinquentBorrowers == 0))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    19.0   679.0   719.0   712.4   739.0   899.0     417
g1 <- ggplot(subset(prosperloans, DelinquentBorrowers == 1), 
             aes(CreditScoreRangeUpper, LoanOriginalAmount)) +
  ggtitle("Delinquent Borrowers") +
  geom_point(alpha = 0.1) +
  scale_x_continuous(limits = c(400, 1000)) +
  scale_y_continuous(limits = c(0, 35000)) +
  theme_minimal()

g2 <- ggplot(subset(prosperloans, DelinquentBorrowers == 0), 
             aes(CreditScoreRangeUpper, LoanOriginalAmount)) +
  ggtitle("Good Standing Borrowers") +
  geom_point(alpha = 0.1) +
  scale_x_continuous(limits = c(400, 1000)) +
  scale_y_continuous(limits = c(1000, 35000)) +
  theme_minimal()

grid.arrange(g1, g2, nrow = 1)
## Warning: Removed 269 rows containing missing values (geom_point).
## Warning: Removed 456 rows containing missing values (geom_point).

In the above comparisons, both scatterplots show a borrower’s credit score against their loan amounts. Once again, this plot reveals some expected and unexpected results.

The higher concentration of delinquent borrowers have lower credit scores, and they also tend to borrow less money, under 10,000. This maybe due to our previous plot which indicates, lower credit scores often tend to have higher interest rates on their loans - this was an expected result. What was a bit unexpected was the lower loan amount for borrowers with much higher credit scores. Based on previous plots, higher scores provide lower rates, which would tend to allow the borrower to gain access to a higher loan amount. And although the average loan amount for loans in good standing are higher, there is still a high concentration of loans under $20,000 and with credit scores over 700.

Loan Amount and Investors

# Plot loans and investors
ggplot(prosperloans, aes(LoanOriginalAmount, Investors)) +
  geom_point(alpha = 0.1, size = 0.75, position = "jitter") +
  scale_x_log10() +
  theme_minimal()

Based on the correlation matrix, a positive relationship was expected. However, the non-linear, somewhat expotential curve of investors to loan amount was unexpected. This plot also shows a wide variance from the centred curve line of the corelation. Of note, Proper doesn’t allow investors to contribute more than 10% of their net worth to any one loan. This could be the underlying factor contributing to the positive correlation between loan amount and number of investors.

Bankcard Utilization and Borrowing Rate

# Loan rates and bankcard utilization as a percentage 
ggplot(prosperloans, aes(BorrowerRate, BankcardUtilization)) + 
  geom_point(alpha = 0.1, size = 0.75, position = "jitter") +
  scale_y_continuous(limits = c(0, 1), labels = scales::percent) +
  facet_grid(DelinquentBorrowers ~ .) +
  theme_minimal()
## Warning: Removed 13186 rows containing missing values (geom_point).

Revisiting bankcard utilization, this time with borrowing rate, it was unexpected to see bankcard utilization over 1 or 100% and how the typical calculation happens doesn’t seem to allow a utilization of over 100%. Another unexpected result, was the large number of borrowers with high bankcard utilization who were in good standing. And although there are a larger number of borrowers in good standing, in comparison to delinquent borrowers, good standing borrowers have a utilization which is more evenly spread through 0 to 100%.

Loan Amount and Year across Income Range

levels(prosperloans$IncomeRange)

# Examin income range and loan amounts across years 
prosperloans <- transform(prosperloans, 
               IncomeRange = factor(IncomeRange, 
                                    levels = positions, 
                                    labels = positions))

ggplot(prosperloans, aes(LoanCreationYear, LoanOriginalAmount)) + 
  geom_boxplot() +
  scale_y_continuous(limits = c(0,20000)) +
  facet_grid(IncomeRange ~ .) +
  theme_minimal()
## Warning: Removed 5285 rows containing non-finite values (stat_boxplot).

These were quite interesting plots, as all income ranges display the same curve (distribution across 2007 - 2014). Again, this could be a result of loans, investors and the financial crisis as mentioned previously. Also, the cateogry “not displayed” is missing data after 2008, and one can assume that possibly this category was removed as an option for users after 2008, forcing users to provide more transparency - which also could have been a side effect of the crisis. However, only in 2007 is there data for any of the other categories, which would imply that during 2005 and 2006, none of the income range categories existed (or the data was lost/not recorded).

# Comparison of delinquent and good standing borrowers
ggplot(prosperloans, aes(LoanCreationYear, 
                         LoanOriginalAmount, 
                         fill = factor(DelinquentBorrowers))) + 
  geom_boxplot() +
  scale_y_continuous(limits = c(0,20000)) +
  facet_grid(IncomeRange ~ .) +
  scale_fill_discrete(name = "Borrowers", breaks = c(0,1), 
                      labels = c("Good Standing", "Delinquent")) +
  theme_minimal()
## Warning: Removed 5285 rows containing non-finite values (stat_boxplot).

With borrowers whose income range is over $100,00, Delinquency for this income range tends to tread higher then good standing accounts, especially with loans created prior to 2008. These delinquent loans tend to have a higher min, max and mean dollar amount. This larger delinquency gap can also be seen with borrowers of the $0 income range.

Final Plots and Summary

# Plot 1
# Borrowers across listing categories
# Log scale will help with comparison across all categories
ggplot(prosperloans, aes(prosperloans$ListingCategoryFullName, 
                         fill = factor(DelinquentBorrowers))) +
  geom_bar(position="dodge") +
  scale_y_log10() +
  theme_minimal() +
  scale_fill_discrete(labels = c("Good Standing", "Delinquent")) +
  labs(x="Loan Category", y="Number of Borrowers", 
       title = "Borrowers per Loan Category and Status") +
  theme(legend.title=element_blank()) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

cat_filtered_data <- subset(prosperloans, 
                            prosperloans$ListingCategoryFullName != "Debt Consolidation")

ggplot(cat_filtered_data, aes(cat_filtered_data$ListingCategoryFullName, 
                              fill = factor(DelinquentBorrowers))) +
  geom_bar(position="dodge") +
  theme_minimal() +
  scale_fill_discrete(labels = c("Good Standing", "Delinquent")) +
  labs(x="Loan Category", y="Number of Borrowers", 
       title = "Borrowers per Loan Category and Status") +
  theme(legend.title=element_blank()) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

In the above plots we examine the different categories of loans (reason borrower’s need the loan) and the total number of borrowers for that type of loan.

The initial plot showed “debt consolidation” as an extreme outlier and therefore the a scale log10 of the y axis was required in order to better visualize and compare all categories.

The debt consolidation category is somewhat misleading as multiple categories could be lumped into one - which is exactly what it is by definition. However, I believe it doesn’t allow for enough details when using this variable for trend or relationship analysis and the category is probably better left ignored during those types of investigations. But one thing we can take away from the high volume of borrowers requesting debt consolidation loans, is that, many borrowers have debt from multiple sources and is by far the most frequent category.

When reviewing the plot under log10 y axis, the low delinquency of boat loans and the lack of any delinquent borrowers for RV loans was quite interesting although the actual number of borrwers is specially low with 52 and 85, respectively. To help clarify the comparison, a 2nd plot without the “debt consolidation” category or the log10 y scale transfrom was plotted. Here, we can see the best loan categories would be “Other”, “Home Improvement” or “Business” and not “boat” or even “RV”.

# Plot 2
# monthly income vs debt to income ratio, highlighting delinquent borrowers
ggplot(prosperloans, aes(StatedMonthlyIncome, 
                         DebtToIncomeRatio, 
                         colour = factor(DelinquentBorrowers))) +
  geom_point(alpha = 0.1) +
  theme_minimal() +
  scale_color_discrete(labels = c("Good Standing", "Delinquent")) +
  scale_x_continuous(limits = c(0,20000), labels = scales::dollar) +
  scale_y_continuous(limits = c(0, 1)) +
  theme(legend.title=element_blank()) +
  labs(x="Monthly Income", y="Debt-to-Income Ratio", 
       title = "Monthly Income to Debt-to-Income Ratio by Status") +
  geom_density2d()
## Warning: Removed 10388 rows containing non-finite values (stat_density2d).
## Warning: Removed 10388 rows containing missing values (geom_point).

In the above plot we examine the stated monthly incomes of borrowers and their debt-to-income ratio. This data was then visually categorized by delinquent and good standing loans.

To mitigate overplotting, density contour and other layers were applied to the plot. These techniques help to reveal trends and unexpected results.

The density contour lines show a high concentration of delinquent borrowers earn less than $2500 a month but have a relatively low debt-to-income ratio of under 0.50 (or 50%). The plot also suggests a negative correlation between monthly income and debt-to-income ratio, i.e the more a borrower makes in monthly income the lower their debt-to-income ratio; However, this does not guarantee the borrower will not go into delinquency.

# Plot 3
# Income Range:

levels(prosperloans$IncomeRange)
## [1] "Not employed"   "$0"             "Not displayed"  "$1-24,999"     
## [5] "$25,000-49,999" "$50,000-74,999" "$75,000-99,999" "$100,000+"
# Comparison of delinquent and good standing borrowers
thousands <- 1000
ggplot(subset(prosperloans, prosperloans$IncomeRange != "Not displayed"), 
       aes(LoanCreationYear, 
           LoanOriginalAmount/thousands, 
           fill = factor(DelinquentBorrowers))) + 
  geom_boxplot() +
  scale_y_continuous(limits = c(0,20), labels = scales::dollar) +
  facet_grid(IncomeRange ~ .) +
  scale_fill_discrete(name = "Borrowers", breaks = c(0,1), 
                      labels = c("Good Standing", "Delinquent")) +
  labs(x="Year (of loan creation)", y="Loan Amount (in dollars 1000's)", 
       title = "Loan Amount by Income Range and Status across Years") +
  theme_minimal()
## Warning: Removed 5073 rows containing non-finite values (stat_boxplot).

In the above plot we examine the borrower loan amount by their income range across loan creation year. The data was then separated and visualized by borrowers in good standing and delinquent.

Category data for “Not Displayed” was removed from the plot. This particular category is missing all data after 2008, and one can assume this category was removed as an option for users after that date, forcing users to provide more transparency. And only in 2007 is there data for any of the other categories, which would imply that during 2005 and 2006, none of the income range categories existed (or the data was lost/not recorded). With this knowledge, I believe ignoring the “Not Displayed” category data would not adversely affect any analysis.

Overall, the plot reveals and confirms some trends by providing a different perspective on the data, such as the distribution across years which “slumps” into a valley type curve in years 2008, 2009 and 2010. This happens across all income ranges and borrowers with good standing and delinquent loans. There were also strange unexpected results specifically, with borrowers whose income range is over $100,000. Delinquency for these borrowers tend to tread higher then good standing accounts, especially with loans created prior to 2008. These delinquent loans also have a higher min, max and mean values. This larger delinquency gap can also be seen with borrowers of the $0 income range, although I’m not sure of any possible connection.

Reflection

The prospser loans dataset contains 113937 observations with 81 variables spanning across 20. Understanding the variables, terminology and general domain knowledge of financial peer-to-peer lending was the first obstacle in approaching this dataset. However, one hurdle was dertermining which variables to analyze, not drifting too far off any one path of investigation and not pulling in new variables throughout the process. Another persistent issue was overplotting on scatterplots, a number of techniques were used across multiple plots.

The general analysis revealed areas of interests such as positive corelation between credit score and borrowing rate which brough up any curious questions concerning delinquent borrowers and the perplexity of cause and effect. Also, trends were confirmed and unexpected, unknown relationships such as those between loan amount and the number of investors contributing to that loan, were revealed.

Also,there were factors discovered which need to be further investigated. I believe additional time in multivariate analysis on variables such as occupation, income range, loan category and delinquencies would expose more trends and perhaps allow for predictions among borrowers (those who will or will not go into delinquency).

Additional data would also enhance this dataset. Having the borrower’s age and sex would allow analysis to possibly discover trends among men and women or young and old. Also, population and state-average-income features, would allow for discovery of the type of environment the borrower lived in.

References

https://en.wikipedia.org/wiki/Charge-off https://en.wikipedia.org/wiki/Default_(finance) https://www.prosper.com/help/topics/how-to-read-a-loan-listing http://t-redactyl.io/blog/2016/05/creating-plots-in-r-using-ggplot2-part-11-linear-regression-plots.html https://www.rstudio.com/wp-content/uploads/2015/02/rmarkdown-cheatsheet.pdf http://r4ds.had.co.nz/

Programming Environment sessionInfo()